Rolling functions with missing values

Motivation

This chapter expands Chapter 5 by using the same methods from Chapter 5. These examples were adapted from @vaughan.

Load the Package

I loaded the ece244 package as I added the functions from the next sections to it with the following code:

load_all()

Additional Packages

I used the bench package to compare the performance of the functions. The package was loaded with the following code:

library(bench)

Cumulative sum (cumsum())

The next function returns the cumulative sum of the elements of a vector:

[[cpp11::register]] doubles cumsum2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  writable::doubles out(n);
  out[0] = x[0];

  if (na_rm == true) {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = y1 + 0.0;
      } else {
        if (ISNAN(y1)) {
          out[i] = 0.0 + y2;
        } else {
          out[i] = y1 + y2;
        }
      }
    }
  } else {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = NA_REAL;
      } else {
        if (ISNAN(y1)) {
          out[i] = NA_REAL;
        } else {
          out[i] = y1 + y2;
        }
      }
    }
  }

  return out;
}

Unlike the cumsum() function from Chapter 5, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the cumulative sum of the coordinates of a vector (C++)
#' @param x numeric vector
#' @param na_rm logical. Should missing values (including `NaN`) be removed?
#' @export
cumsum2_cpp <- function(x, na_rm = FALSE) {
  cumsum2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

set.seed(123) # for reproducibility
x <- rpois(1e6, lambda = 2) # 1,000,000 elements

cumsum(c(1, NA, 2, 4))
[1]  1 NA NA NA
cumsum2_cpp(c(1, NA, 2, 4))
[1]  1 NA NA NA
cumsum2_cpp(c(1, NA, 2, 4), na_rm = TRUE)
[1] 1 1 3 7
mark(
  cumsum(x),
  cumsum2_cpp(x)
)
# A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 cumsum(x)        1.71ms   2.05ms     424.     3.81MB     52.7
2 cumsum2_cpp(x)  32.62ms  36.84ms      27.7   15.26MB     49.8

Cumulative product (cumprod())

The next function calculates the cumulative product of the elements of a vector:

[[cpp11::register]] doubles cumprod2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  writable::doubles out(n);
  out[0] = x[0];

  if (na_rm == true) {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = y1 * 1.0;
      } else {
        if (ISNAN(y1)) {
          out[i] = 1.0 * y2;
        } else {
          out[i] = y1 * y2;
        }
      }
    }
  } else {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = NA_REAL;
      } else {
        if (ISNAN(y1)) {
          out[i] = NA_REAL;
        } else {
          out[i] = y1 * y2;
        }
      }
    }
  }

  return out;
}

The corresponding auxiliary function for documentation is:

#' Return the cumulative product of the coordinates of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
cumprod2_cpp <- function(x, na_rm = FALSE) {
  cumprod2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

cumprod(c(1, NA, 2, 4))
[1]  1 NA NA NA
cumprod2_cpp(c(1, NA, 2, 4))
[1]  1 NA NA NA
cumprod2_cpp(c(1, NA, 2, 4), na_rm = TRUE)
[1] 1 1 2 8
mark(
  cumprod(x),
  cumprod2_cpp(x)
)
# A tibble: 2 × 6
  expression           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 cumprod(x)        1.81ms   2.08ms     328.     15.3MB    448. 
2 cumprod2_cpp(x)  31.41ms   33.7ms      29.8    15.3MB     26.1

Cumulative minimum (cummin())

The next function calculates the cumulative minimum of the elements of a vector:

[[cpp11::register]] doubles cummin2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  writable::doubles out(n);
  out[0] = x[0];

  if (na_rm == true) {
    for (int i = 1; i < n; ++i) {
      double y1 = x[i - 1], y2 = x[i];
      if (ISNAN(y1)) {
        out[i] = y2;
      } else {
        out[i] = std::min(y1, y2);
      }
    }
  } else {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = NA_REAL;
      } else {
        if (ISNAN(y1)) {
          out[i] = NA_REAL;
        } else {
          out[i] = std::min(y1, y2);
        }
      }
    }
  }

  return out;
}

The corresponding auxiliary function for documentation is:

#' Return the cumulative minimum of the coordinates of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
cummin2_cpp <- function(x, na_rm = FALSE) {
  cummin2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

cummin(c(1, NA, 2, 4))
[1]  1 NA NA NA
cummin2_cpp(c(1, NA, 2, 4))
[1]  1 NA NA NA
cummin2_cpp(c(1, NA, 2, 4), na_rm = TRUE)
[1] 1 1 1 1
mark(
  cummin(x),
  cummin2_cpp(x)
)
# A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 cummin(x)        1.06ms   1.19ms     819.     3.81MB     85.6
2 cummin2_cpp(x)  31.79ms  32.16ms      31.1   15.26MB     14.1

Cumulative maximum (cummax())

The next function calculates the cumulative maximum of the elements of a vector:

[[cpp11::register]] doubles cummax2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  writable::doubles out(n);
  out[0] = x[0];

  if (na_rm == true) {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y1)) {
        out[i] = y2;
      } else {
        out[i] = std::max(y1, y2);
      }
    }
  } else {
    for (int i = 1; i < n; ++i) {
      double y1 = out[i - 1], y2 = x[i];
      if (ISNAN(y2)) {
        out[i] = NA_REAL;
      } else {
        if (ISNAN(y1)) {
          out[i] = NA_REAL;
        } else {
          out[i] = std::max(y1, y2);
        }
      }
    }
  }

  return out;
}

The corresponding auxiliary function for documentation is:

#' Return the cumulative maximum of the coordinates of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
cummax2_cpp <- function(x, na_rm = FALSE) {
  cummax2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

cummax(c(1, NA, 2, 4))
[1]  1 NA NA NA
cummax2_cpp(c(1, NA, 2, 4))
[1]  1 NA NA NA
cummax2_cpp(c(1, NA, 2, 4), na_rm = TRUE)
[1] 1 1 2 4
mark(
  cummax(x),
  cummax2_cpp(x)
)
# A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 cummax(x)        1.01ms   1.11ms     741.     3.81MB     65.1
2 cummax2_cpp(x)  31.11ms  32.01ms      30.7   15.26MB     23.9

References